-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Implement multithreading in qgemm_kleidi #26301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multithreading in qgemm_kleidi #26301
Conversation
|
@microsoft-github-policy-service agree company="Arm" |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
Can we get workflows ran please |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
|
General sanity check question: Are there enough tests that trigger all the nuances of the multi-threaded implementation - Are there enough tests with multiple batch sizes, M, and N dimensions that exercise all aspects of the multi-threaded implementation ? |
|
Will trigger CI once you push commits addressing the PR feedback (right now I only see a rebase). Thanks. |
We checked the existing tests for qgemm. In current implementation tests are supported for thread pool = null. We created a follow up ticket for test coverage. |
If all the tests are with ThreadPool == null, does that mean the new threadpool based parallel code path(s) are not exercised ? |
It means it was not exercised on the onnxruntime_mlas_test run, but it is on the onnxruntime_perf_test. However, unit tests for the multithreaded code added now, in the latest commit. Both cases can use multiple threads in the latest situation. |
Signed-off-by: melkap01 <[email protected]>
unused variable removed, unnecessary temp_tile use and copy removed, K==0 case checked Signed-off-by: melkap01 <[email protected]>
Signed-off-by: melkap01 <[email protected]>
Signed-off-by: melkap01 <[email protected]>
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Jonathan Clohessy <[email protected]>
64c59e5 to
3fcba09
Compare
Signed-off-by: Jonathan Clohessy <[email protected]>
27d570c to
0c60f4e
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Signed-off-by: Jonathan Clohessy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements multithreading and tiling for Dynamic Quantized GEMM operations using KleidiAI kernels to improve performance on ARM64 SME/SME2 architectures. The changes introduce thread-local buffers for memory reuse during inference and update KleidiAI to version 1.15.0.
Changes:
- Refactored dynamic quantization matrix multiplication to use thread-local buffers and parallel tiling across batch, M, and N dimensions
- Moved KleidiAI packing logic from operator-specific code to a reusable base class
- Extended test coverage to include single-threaded and multi-threaded test suites with edge cases
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/mlas/unittest/test_dynamic_qgemm.cpp | Splits tests into single-thread and thread-pool variants, adds proper quantization simulation and edge case handling |
| onnxruntime/test/contrib_ops/dynamic_quantize_matmul_test.cc | Adds KleidiAI-specific tests for bias handling, zero-point validation, and fallback scenarios |
| onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h | Extracts KleidiAI prepacking logic into reusable helper methods in the base class |
| onnxruntime/core/mlas/lib/qgemm.cpp | Updates availability check to include both SME and SME2 |
| onnxruntime/core/mlas/lib/kleidiai/qgemm_kleidiai.cpp | Implements multi-threaded tiling with thread-local buffers and adds input validation |
| onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h | Adds UseSME flag alongside existing UseSME2 |
| onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc | Simplifies by delegating prepacking to base class and removes duplicate code |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/contrib_ops/cpu/quantization/dynamic_quantize_matmul.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/cpu/quantization/matmul_integer_base.h
Outdated
Show resolved
Hide resolved
|
Please rebase with main and the CUDA / TensorRT issues should go away |
|
May have some conflicts with #26849 |
Signed-off-by: Jonathan Clohessy <[email protected]>
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Key changes
This PR makes changes to improve the performance on Dynamic Qgemms by implementing tiling and threading across operations.
The changes introduce thread local buffers for reusing memory during inference. And utilizes those in Dynamic Quantised Matmul operations using Kleidiai kernels.
And updating KleidiAI version to 1.15.0
Example performance
single thread :

2 threads :
